{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13626594",
   "metadata": {},
   "source": [
    "### 梯度增强（Gradient Boost）\n",
    "梯度增强是一种通过循环迭代将模型添加到集合中的方法。<br/>\n",
    "它首先用一个模型初始化集合，这个模型的预测可能非常简单。(即使它的预测非常不准确，后续添加的集合将解决这些错误。)<br/>\n",
    "然后，我们开始循环迭代：\n",
    "\n",
    "###### ·首先，我们使用当前集成来为数据集中的每个观测结果生成预测。为了进行预测，我们将所有模型的预测添加到集成中。<br/>\n",
    "###### ·这些预测被用来计算损失函数(例如，平均平方误差)。<br/>\n",
    "###### ·然后，我们使用损失函数来适应一个新的模型，这个模型将被添加到集成中。具体地说，我们确定模型参数，以便将这个新模型添加到集成中来减少损失。(注:“梯度推进”中的“梯度”指的是我们将对损失函数使用梯度下降法来确定新模型中的参数。)<br/>\n",
    "###### ·最后，我们将新的模型加入到集成中，并且重复…<br/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "183fce86",
   "metadata": {},
   "source": [
    "#### XGBoost是极端梯度增强(eXtreme Gradient Boosting)的缩写，它是梯度增强的一种实现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "36b24226",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Rooms</th>\n",
       "      <th>Distance</th>\n",
       "      <th>Landsize</th>\n",
       "      <th>BuildingArea</th>\n",
       "      <th>YearBuilt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3229</th>\n",
       "      <td>4</td>\n",
       "      <td>10.5</td>\n",
       "      <td>481.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4341</th>\n",
       "      <td>2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>0.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>1970.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8566</th>\n",
       "      <td>3</td>\n",
       "      <td>13.8</td>\n",
       "      <td>577.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9967</th>\n",
       "      <td>3</td>\n",
       "      <td>31.6</td>\n",
       "      <td>2385.0</td>\n",
       "      <td>127.0</td>\n",
       "      <td>1930.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3041</th>\n",
       "      <td>3</td>\n",
       "      <td>13.7</td>\n",
       "      <td>495.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Rooms  Distance  Landsize  BuildingArea  YearBuilt\n",
       "3229      4      10.5     481.0           NaN        NaN\n",
       "4341      2       2.3       0.0          64.0     1970.0\n",
       "8566      3      13.8     577.0           NaN        NaN\n",
       "9967      3      31.6    2385.0         127.0     1930.0\n",
       "3041      3      13.7     495.0           NaN        NaN"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#加载训练和验证数据X_train、X_valid、y_train和y_valid\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# 读取数据\n",
    "data = pd.read_csv('../melb_data.csv')\n",
    "\n",
    "# 选择预测子数据集\n",
    "cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']\n",
    "X = data[cols_to_use]\n",
    "\n",
    "# Select target\n",
    "y = data.Price\n",
    "\n",
    "# 将数据拆分为训练和验证数据\n",
    "X_train, X_valid, y_train, y_valid = train_test_split(X, y)\n",
    "# 用head()方法瞄一眼X_train\n",
    "X_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9295904e",
   "metadata": {},
   "source": [
    "#### 导入用于scikit-learn API的XGBoost (xgboost.XGBRegressor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "47aca74d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n",
       "             importance_type='gain', interaction_constraints='',\n",
       "             learning_rate=0.300000012, max_delta_step=0, max_depth=6,\n",
       "             min_child_weight=1, missing=nan, monotone_constraints='()',\n",
       "             n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n",
       "             tree_method='exact', validate_parameters=1, verbosity=None)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# anaconda 安装xgboost ：conda install -c anaconda py-xgboost\n",
    "from xgboost import XGBRegressor\n",
    "\n",
    "my_model = XGBRegressor()\n",
    "my_model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0d5f9480",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean Absolute Error: 239548.9475285346\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "predictions = my_model.predict(X_valid)\n",
    "print(\"Mean Absolute Error: \" + str(mean_absolute_error(predictions, y_valid)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "56932857",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n",
       "             importance_type='gain', interaction_constraints='',\n",
       "             learning_rate=0.300000012, max_delta_step=0, max_depth=6,\n",
       "             min_child_weight=1, missing=nan, monotone_constraints='()',\n",
       "             n_estimators=500, n_jobs=12, num_parallel_tree=1, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n",
       "             tree_method='exact', validate_parameters=1, verbosity=None)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_model = XGBRegressor(n_estimators=500)\n",
    "my_model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1c3e1005",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n",
       "             importance_type='gain', interaction_constraints='',\n",
       "             learning_rate=0.300000012, max_delta_step=0, max_depth=6,\n",
       "             min_child_weight=1, missing=nan, monotone_constraints='()',\n",
       "             n_estimators=500, n_jobs=12, num_parallel_tree=1, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n",
       "             tree_method='exact', validate_parameters=1, verbosity=None)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_model = XGBRegressor(n_estimators=500)\n",
    "my_model.fit(X_train, y_train, \n",
    "             early_stopping_rounds=5, \n",
    "             eval_set=[(X_valid, y_valid)],\n",
    "             verbose=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "380dad84",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n",
       "             importance_type='gain', interaction_constraints='',\n",
       "             learning_rate=0.05, max_delta_step=0, max_depth=6,\n",
       "             min_child_weight=1, missing=nan, monotone_constraints='()',\n",
       "             n_estimators=1000, n_jobs=12, num_parallel_tree=1, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n",
       "             tree_method='exact', validate_parameters=1, verbosity=None)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)\n",
    "my_model.fit(X_train, y_train, \n",
    "             early_stopping_rounds=5, \n",
    "             eval_set=[(X_valid, y_valid)], \n",
    "             verbose=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1a9a88ea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n",
       "             importance_type='gain', interaction_constraints='',\n",
       "             learning_rate=0.05, max_delta_step=0, max_depth=6,\n",
       "             min_child_weight=1, missing=nan, monotone_constraints='()',\n",
       "             n_estimators=1000, n_jobs=4, num_parallel_tree=1, random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n",
       "             tree_method='exact', validate_parameters=1, verbosity=None)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)\n",
    "my_model.fit(X_train, y_train, \n",
    "             early_stopping_rounds=5, \n",
    "             eval_set=[(X_valid, y_valid)], \n",
    "             verbose=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cc4650be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean Absolute Error: 249005.4734996318\n"
     ]
    }
   ],
   "source": [
    "# from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "predictions = my_model.predict(X_valid)\n",
    "print(\"Mean Absolute Error: \" + str(mean_absolute_error(predictions, y_valid)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
